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Abstract 



We study generalized density-based clustering in which sharply defined clusters such as clusters on 
lower dimensional manifolds are allowed. We show that accurate clustering is possible even in high 
dimensions. We propose two data-based methods for choosing the bandwidth and we study the stability 
properties of density clusters. We show that a simple graph-based algorithm successfully approximates 
the high density clusters. 

1 Introduction 

It has been observed that classification methods can be very accurate in high dimensional problems, ap- 
parently contradicting the curse of dimensionality. A plausible explanation for this phenomenon is the 
"low-noise" condition described, for instance, in Mammen and Tsybakov (1999). When the low noise con- 
dition holds, the probability mass near the decision boundary is small and fast rates of convergence of the 
classification error are possible in high dimensions. 

Similarly, clustering methods can be very accurate in high dimensional problems. For example, clustering 
subjects based on gene profiles and clustering curves are both high dimensional problems where several 
methods have worked well despite the high dimensionality. This suggests that it may be possible to find 
conditions that explains the success of clustering in high dimensional problems. 

In this paper we focus on clusters that are defined as the connected components of high density regions 
(Cuevas and Fraiman, 1997; Hartigan, 1975). The advantage of density clustering over other methods is 
that there is a well-defined population quantity being estimated and density clustering allows the shape of 
the clusters to be very general. (A related but somewhat different approach for generally shaped clusters 
is spectral clustering; see (von Luxburg, 2007) and (Ng et al., 2002).) Of course, without some conditions, 
density estimation is subject to the usual curse of dimensionality. One would hope that an appropriate 
low noise condition would obviate the curse of dimensionality. Such assumptions have been proposed by 
Polonik (1995), Rigollet (2007), Rigollet and Vert (2006), and others. However, the assumptions used by 
these authors rule out the case where the clusters are very sharply defined, which should be the easiest cases, 
and, more generally, clusters defined on lower dimensional sets. 

The purpose of this paper is to define a notion of density clusters that does not rule out the most favorable 
cases and is not limited to sets of full dimension. We study the risk properties of density-based clustering 
and its stability properties, and we provide data-based methods for choosing the smoothing parameters. 

The following simple example helps to illustrate our motivation. We refer the reader to the next section 
for a more rigorous introduction. Suppose that a distribution P is a mixture of finitely many point masses 
at distinct points xi, . . . ,Xk where Xj G R d . Specifically, suppose that P = k^ 1 X)j=i $j where Sj is a point 
mass at Xj. The clusters are C\ — {^i}, ■ ■ ■ , — {xk}- This is a trivial clustering problem even if the 
dimension d is very high. The clusters could not be more sharply defined yet the density does not even exist 
in the usual sense. This makes it clear that common assumptions about the density such as smoothness or 
even boundedness are not well-suited for density clustering. 

Now let ph = dPh/dfi be the Lebesgue density of the measure Ph obtained by convolving P with the 
probability measure having Lebesgue density Kh, a kernel with bandwidth h. Unlike the original distribution 
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P, Ph has full-dimensional support for each positive h. The "mollified" density ph contains all the information 
needed for clustering. Indeed, there exist constants h > and A > such that the following facts are true: 

1. for all < h < h, the level set {x : Ph(x) > A} has disjoint, connected components C{ 1 , . . . , C£; 

2. the components Cj 1 contain the true clusters: Cj C Cj 1 for j = 1, . . . , k; 

3. although Cj 1 overestimates the true cluster Cj, this overestimation is inconsequential since P(Cj , —Cj) — 
and hence a new observation will not be misclustered; 

4. let ph denote the kernel density estimator using Kh with fixed bandwidth < h < h and based on a 
i.i.d. sample of size n from P. Then, sup^ \ph{x) —ph(x)\ — 0(y/logn/n) almost everywhere P, which 
does not depend on the dimension d (see Section 3.1). The bias from using a fixed bandwidth h — 
which does not vanish as n — > oo — does not adversely affect the clustering. 

In summary, we can recover the true clusters using an estimator of the density ph with a large bandwidth 
h. It is not necessary to assume that the true density is smooth or that it even exists. 
Our contributions in this paper are the following: 

1 . We develop a notion of density clustering that applies to probability distributions that have non-smooth 
Lebesgue densities or do not even admit a density. 

2. We find the rates of convergence for estimators of these clusters. 

3. We study two data-driven methods for choosing the bandwidth. 

4. We study the stability properties of density clusters. 

5. We show that the depth-first search algorithm on the p-nearest neighborhood graph of {ph > A} is 
effective at recovering the high-density clusters. 

Another approach to clustering that does not require densities is the minimum volume set approach 
(Polonik (1995), Scott and Nowak (2006)). Our approach is different because we are specifically trying to 
capture the idea that kernel density estimates are useful for clustering even when the density may not exist. 

Section 2 contains notation and definitions. Section 3 contains results on rates of convergence. We give a 
data-driven method for choosing the bandwidth in Section 4. Section 4.2 contains results on cluster stability. 
The validity of the graph-based algorithm for approximating the clusters is proved in Section 5. Section 6 
contains some examples based on simulated data. Concluding remarks are in Section 7. All proofs are in 
the Section 9. Some technical details are in Appendix A. 

Notation. For two sequences {a n } and {b n }, we write a n = 0(b n ) and a n = £l(b n ) if there exists a 
constant C > such that, for all n large enough, \a n \/b n < C and \a n \/b n > C, respectively. If a n = fl(b n ) 
and a n — 0(b n ), then we will write a n x b n . We denote with ¥(E) the probability of a generic event E, 
whenever the underlying probability measure is implicitly understood from the context. By the dimension 
of a Euclidean set we will always mean the fc-dimensional Hausdorff dimension for some integer < k < d 
(see Appendix A). These sets may consist, for example, of smooth submanifolds or even single points. 

2 Settings and Assumptions 
2.1 Level Set Clusters 

In this section we develop a probabilistic framework for the definition of clusters we have adopted. For ease 
of readability, the more technical measure-theoretic details are given in Appendix A. 
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Let P be a probability distribution on M. d whose support 5 (the smallest closed set of P- measure 1) is 
comprised of an unknown number m of disjoint compact sets {Si, . . . , S m } of different dimensions. We define 
the geometric density of P as the measurable function p: R d i-> K given by 

^) = lim^%M, (1) 

where B{x, e) is the Euclidean ball of radius h centered at x, ji is the <i-dimensional Lebesgue measure and 
Vd = fJ,(B(0, 1)). Note that, almost everywhere P, p(x) = oo if and only if x belongs to some set Si having 
dimension strictly less than d and is positive and finite if and only if x belongs to some d-dimcnsional set 
Si. In general, J Rd p(x)d[j,(x) < 1 and, therefore, p is not necessarily a probability density. Nonetheless, p 
can be used to recover the support of P, since 



S={x: p{x) > 0}, 



where for a set A C R d , A denotes its closure. 
For A > , define the A-level set 



L = L{\) = {x: p(x)>\}, (2) 

and its boundary dL{\) = {x: p(x) = A}. Throughout the paper, we will suppose that we are given a 
fixed value of A < ||p||oo, where ||p||oo = sup x&R dp(x). Often, A is chosen so that P(L(X)) w 1 — a for some 
given a. In practice, it is advisable to present the results for a variety of values of A as we discuss in Section 7. 



We assume that there are k > 1 disjoint, compact, connected sets Ci(A), . . . , Ck(X) such that 

L = C 1 (A)U---UC fc (A). 

We will often write Cj instead of of Cj(X) when the dependence of A is clear from the context. The value 
of k is not assumed to be known. The sets Ci, . . . , Ck are called the X-clusters of p, or just clusters. In our 
setting, the Cj's need not be full dimensional. Indeed, Cj might be a lower-dimensional manifold or even a 
single point. Furthermore, if Si has dimension smaller than d, then Cj = Si, for some j = 1, . . . , k. Thus, 
for any A > 0, the A-clusters of p will include all the lower-dimensional components of S. On the other hand, 
if Si is full-dimensional, then there may be multiple clusters in it, depending on the value of A. 

We observe an i.i.d. sample X = (X\, . . . ,X n ) from P, from which we construct the kernel density 
estimator 

where Cd = J Rd K(x)dfi(x). For simplicity, we assume that the kernel K: R d i— ► K + is a symmetric, bounded, 
smooth function supported on the Euclidean unit ball. These assumptions can be easily relaxed to include, 
for instance, the case of regular kernels as defined in Devroye et al. (1997, Chapter 10). In particular, while 
the compactness of the support of K simplifies our analysis, it is not essential and could be replace by 
assuming fast decaying tails. Further conditions on the kernel K are discussed in Section 3.1. 
Let ph : K. i— > K be the measurable function given by 

Ph (x) - / K h (x - y)dP(y) = E(p h {x)), (4) 
Js 

where K h (x) = (^^j^^j ■ Also, let K h [i be the probability measure given by K h [i{A) = J A K h (x)dfi(x), 

for any Borel set A C R d . Then, ph is the Lebesgue density of the probability measure Ph obtained by 
convolving P with if^/i. More precisely, for each measurable set A, 

Ph(A)= [ [ K h (x - y)dP(y)dii(x) = [ p h {x)d»(x). 

JAJS J A 
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Borrowing some terminology from analysis, where the kernel K is referred to as a mollifier, we call the 
measure Ph and the density ph as the mollified measure and mollified density, respectively. For each h, the 
mollification of P by K yields that 

1. the mollified measure Ph has full-dimensional support S B(0,h) and is absolutely continuous with 
respect to /x; here, for two set A and B in K d , A(B B = {x + y, : x £ A,y £ B} denotes its Minkowski 
sum; 

2. the mollified density ph is of class C a whenever K is of class C a , with a £ N+ U {oo}. (A real valued 
function is of class C a if its partial derivatives up to order a exist and are continuous.) 

(As a referee pointed out to us, the properties of mollified measures are related to the classical theory 
of distributions.) Mollifying P makes it better behaved. At the same time, Ph and ph can be seen as 
approximations of the original measure P and the geometric density p, respectively, in a sense made precise 
by the following result. 

Lemma 2.1. As h — ► 0, Ph converges weakly to P and Yvaih—>a Ph{x) = p(x), almost everywhere P. 

To estimate the A-clusters of p, we use the connected components of L, i.e. the A-clusters of ph- That is, 
we estimate L with 

L = L h {X) = [x: p h (x)>\}. (5) 

In practice, finding the estimated clusters is computationally difficult. Indeed, to verify that two points 
xi and X2 are in the same cluster, we need to find at least one path 7 C M. d connecting them such that 
Ph{x) > A for each x £ 7. Conversely, when X\ and X2 do not belong to the same cluster, this property has 
to be shown to fail for every possible path between them. We discuss an algorithm for approximating the 
clusters in Section 5. Until then, we ignore the computational problems and assume that the A-clusters of 
Ph can be computed exactly. 

2.2 Risk 

We consider two different risk functions. 

• The level set risk is defined to be R L (p,ph) = K(p(p,ph, P)), where 

p(r,q,P)= f dP(x), (6) 

J{x: r(x)>\}A{x: q(x)>\} 

and AAB = (An B c ) U (A c n B) is the symmetric set difference. 

• Define the excess mass functional as 

£(A) = P(A) - A M (A), (7) 

for any measurable set A C K d . This functional is maximized by the true level set L; see Mueller and 
Sawitzki (1991) and Polonik (1995). We can use the excess mass functional as a risk function except, 
of course, that we maximize it rather than minimize it. Given an estimate L of L based on ph, we will 
then be interested in making the excess mass risk 

R M ( P ,Ph)=£(L)-E(£(L)) (8) 

as small as possible. Furthermore, if P has full-dimensional support, simple algebra reveals that 
maximizing £(A) is equivalent to minimizing, 

/ \P~ A|d/i 

J AAL 
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which is the loss function used by Willctt and Nowak (2007). In this case the minimizer L is unique. 
More generally, if P — Pq + Pi where Po is the part of P that is absolutely continuous with respect to 
the Lebesgue measure, then 

£(L)-£(A)= f \p -X\dn+P 1 (L)-P 1 (A), (9) 
Jaal 

where po = ^j-. It is clear that L is no longer the unique minimizer of the excess mass functional. 

2.3 Assumptions 

Throughout our analysis we assume the following conditions. 
(CI) There exist positive constants 7, C\ and e such that 

V\\p{X)-X\ <e) <Cie\ Ve€[0,e). 

(C2) There exist a positive constant h, and a permutation a of {1, . . . , k} such that, for all h G (0, h) and 
all A' € (A - e, A + e), 

k 

L h (\i)=\JC*(\>), 

where 

(a) Cf(A') n C$(X)' = for 1 < i < j < k; 

(b) Q(A')C Confer all 1 <j<k. 

(C3) There exist a positive constant Ci such that, for all h € (0, h) and A' G (A — e, A], 

fc 

i(A') = IJ C i( X ')> 

where 

n(dCj(X') © B(0,/i)) < C 2 /i (d - rf ' )vl , 
and is the dimension of the component Si of the support of P such that Cj(A') C Sj. 

2.4 Remarks on the Assumptions 

Conditions of the form (CI) or of other equivalent forms, are also known as low noise condition or margin 
conditions in the classification literature. They have appeared in many places, such as Tsybakov (1997), 
Mammcn and Tsybakov (1999), Bai'llo ct al. (2001), Tsybakov (2004), Steinwart et al. (2005), Cuevas et al. 
(2006), Cadre (2006), Audibert and Tsybakov (2007), Castro and Nowak (2008) and Singh et al. (2009). 

This condition, first introduced in Polonik (1995), provides a way to relate the stochastic fluctuations of 
Ph around its mean p^ to the clustering risk. Indeed, the larger 7, the smaller the effects of these fluctuations, 
and the easier it is to obtain good clusters from noisy estimates of p^, for any h < h. 

Conditions (C2) simply require that the level set of the mollified density include the true clusters. The 
additional fringe — L can be viewed as a form of clustering bias. Though mild and reasonable, these 
asssumptions are particularly important, as they imply that the estimated density ph can be used quite 
effectively for clustering purposes, for a range of bandwidth values. This is is shown in the next simple 
result. Let N(X), Ni l (X) and N^X) denote the number of A-clusters for p, p^ and p^, respectively. Notice 
that we do not require p to satisfy any smoothness properties. See Section 3.3 for the case case of smooth 
densities. 
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Lemma 2.2. Under conditions ( C2) and for all e € (0, e) and h £ (0, h), on the event £h n — {\\ph ~ P/i||oo < 
e}, 

N h (X) = jV^(A) = k. 

Conditions (C3) is used to obtain establish rates of convergence for the level set risk and the excess 
mass risk. It provides a way of quantifying the clustering bias due to the use of the mollified density ph as 
a function of the bandwidth h, locally in a neighborhood of A. In fact, if conditions (C2) hold, then the 
clustering bias is due to the sets Lh(\ — e) — L(A — e), for h £ (0, h) and e £ [0, e). 

Lemma 2.3. Under conditions (C2) and (C3), for all h G (0, h) and e £ [0, e) such that A — e > 0, 

p(L h (X~e)-L(X-e)) < C 2 h e , (10) 

where 

d — maxi di + 1 if max^ di > 
d otherwise, 

and, for some positive constant C3, 

P(L h (\-e)-L(\-e))<C 3 ht, (11) 
where £ is either 00 or 1; in particular, £ = 1 only i/max^ di = d. 

Condition (C3) is rather mild and depends only on dimension of the support of P. Indeed, it follows 
from the rectifiability property (see Appendix A for details) that if, Si is a component of the support of P of 
dimension di < d, then 5,; has box-counting dimension di, (see, e.g., Ambrosio et al., 2000, Theorem 2.104). 
This implies that (C3) is satisfied for all h small enough (see also Falconer, 2003). For clusters Cj belonging 
to full-dimensional components of the support of P, (C3) follows if the sets dCj(X') have box-counting 
dimension d — 1 for all A' G (A — e, A] and if h is small enough. In fact, under these additional assumptions, 
it is possible to show that the bounds in Lemma 2.3 are sharp in the sense that, for all A' £ (A — e, A] and 
h £ (0, h), p, (Lh{\') — L(\')) = il(h e ). In addition, provided that h is smaller than the minimal inter-cluster 
distance 

min inf \\x — y\\, 
i^j xeCjveCj 

we also obtain that P(Lh(\ — e) — L(X — e)) — fi(/i>), where £ can only be 1 or 00. 

Finally, we point out that the value of h depends on the curvature of the components dL{\ — e) for all 
e £ [0,e), and on the minimal inter-cluster distance. The smaller the condition numbers (see, e.g., Niyogi 
ct al., 2008) of these components, and the larger the inter-cluster distance, the larger h. 

Although the rates are not affected by the constants, in practice, they can have a significant effect on the 
results, since they may very well depend on d. This is especially true of C\, as illustrated in Example 2.7 
below. 



2.5 A Refined Analysis of Condition (CI) 

We conclude this section with some comments on the parameter 7 appearing in condition (CI), whose value 
affects in a crucial way the consistency rates, with faster rates arising from larger values of 7. If S has 
dimension smaller than d, then, clearly, 7 = 00, thus throughout this subsection we assume that P is a 
probability measure on M. d having Lebesgue density p. 

First, a fairly general sufficient condition for assumption (CI) to hold with 7 = 1 at A can be easily 
obtained using probabilistic arguments as follows. Let G denote the distribution of the random variable 
Y = p(X) and suppose G has a Lebesgue density g which is bounded away from and infinity on (A— e, A+e). 
Then, by the mean value theorem, for any non-negative e < e, 

P(A - e < p(X) < A + e) = G({y: y £ (A + e, A — e)}) = eg(X + V ), 



G 



for some r\ £ (— e, e). Thus, (CI) holds with 7 — 1 at A. See also Example 2.7 below. A more refined result 
based on analytic conditions is given next. Below 'H d ~ 1 denotes the (d — l)-dimensional Hausdorff measure 
in K d . See Appendix A for the definition of Hausdorff measure. 

Lemma 2.4. Suppose that P is a probability measure onM. d having Lipschiz density p. Assume that, almost 
everywhere fi, ||Vp(x)|| > and that 7i ^({x: p(x) — A}) < 00 for any A £ (0, ||p||oo)- Then, (CI) holds 
with 7=1 for each A £ (0, ||p||oo) except for a set of Lebesgue measure 0. 

A further point of interest is to characterize the set of A values for which condition (CI) holds with 7^1. 
Clearly, if p has a jump discontinuity, then (CI) holds with 7 = 00, for all values of A in some interval. On 
the other hand, due to the previous result, if ||Vp|| is bounded away from and 00 in a neighborhood of 
p _1 (A), then 7=1. Thus one could expect a value of 7 different than 1 when Vp does not exist or when 
||Vp|| is infinity or vanishes in p _1 (A). See the example on page 7 in Rigollct and Vert (2006), where (CI) 
holds with 7 < 1 if q > d and 7 > 1 if q < d, the former case corresponding to ||Vp(xo)|| = and the latter 
to lim. x —> Xo ||Vp(x)|| = 00. However, this would seem to indicate that, if p is sufficiently regular, the values 
of A for which 7 ^ 1 form a negligible set of M. Lemma 2.4 above already shows that this set has Lebesgue 
measure zero if p is Lipschitz with non-vanishing gradient. Under stronger assumptions, it can be verified 
that this set is in fact finite. 

Corollary 2.5. Under the assumption of Lemma 2.4, if p is of class C 1 and has compact support, then the 
set of A such that (CI) holds with 7 7^ 1 is finite. 

Example 2.6. Sharp Clusters and Lower-dimensional Clusters. Suppose that p = ^ = YliLi ^jPj where 
Pi is a density with support on a compact, connected set Si, 7Tj = 1 and min^ ixi > 0. Moreover suppose 
that 

min inf llx — yll > 

&j xed.yec, 

where d(A, B) = iitf x eA,yeB \\x — y\\. Finally suppose that 

min inf itjp(x) > A. 

We denote clusters of this type as sharp clusters. See Singh et al. (2009), for example. It is easy to 
see that (CI) and (11) hold with 7 = £ = 00. A more general example in which one of the mixture 
component is supported on a lower dimensional set is shown in Figure 1. Here, the true distribution is 
P = (l/3)Unif(-5.5, -4.5) + (l/3)Unif (4.5, 5.5) + (l/3)<5 . The geometric density and the mollified density 
based on h = .04 are shown in the top plot. The point mass at is indicated with a vertical bar. The bottom 
plot shows the true clusters and the mollified clusters based on ph with A = .04. The clusters based on ph 
contain the true clusters and the difference between them is a set of zero probability. 

Example 2.7. Normal Distributions. Suppose that X ~ A^(0,S), with E positive definite. Set a = \T,\ 1 / 2 . 
Then, (CI) holds for any < A < [a (V27r) J with 7=1 and C\ = Cdla (V27r) , where the constant d 

depends on d (and, of course, A). We prove the claim only for A = a [a (\/27r) d ^ , where a £ (0, 1). Cases 

in which a = 1 or a = can be dealt with similarly. Let W ~ Xd an d notice that X T Y,~ 1 X = W. For all 
e > smaller than 



min 




simple algebra yields 

P(\MX)-\\<e) = p(21og 1— . 7 <^<21og 1 d ) 

= 2 ( log 1 1 7¥a* ~ log I ) /^\ d ) Pd ( log _u h^Y> ) ' 
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Mollified clusters 
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True clusters 
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Figure 1: Sharp clusters. Top: the density of P = (l/3)Unif (-5.5, -4.5) + (l/3)Unif (4.5, 5.5) + (l/3)<5 and 
the mollified density ph for h — .04. The point mass at is indicated with a vertical bar. Bottom: the true 
clusters and the mollified clusters of pt with A = .04. 
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Figure 2: Noise exponent for Gaussians. Each curve shows P(\p(X) — A| < e) versus e for a = 1/2. The plots 
are nearly linear since 7 = 1 in this case. 

for some 77 € (— e, e) where denotes the density of a distribution and the second equality holds in virtue 
of the mean value theorem. By a first order Taylor expansion, for e J, 0, the first term on the right hand side 
of the previous display can be written as 



(Vto) d 1 ~ d + \ ~ d + (6 2 ). 

\a — ea (y/2ir) a + ea (v2tt) / 



2ea 



Since : 1 d + — — 1 d p d log — — 1 d x 1 for any e > bounded by (12), the claim is 

y a— ecrf v27rj Q+ecrfv27rJ y \ q+jjctI \/27r I J 

proved. See Figure 2. 



3 Rates of Convergence 

In this section we study the rates of convergence in the two distances using deterministic bandwidths. We 
defer the discussion of random (data driven) bandwidths until Section 4. 

3.1 Preliminaries 

Before establishing consistency rates for the different risk measures described above, we discuss some neces- 
sary preliminaries. 

In our analysis we require the event 

£h,e = {\\Ph-Ph\\oo<e}, ee(0,e),h€{0,h), (13) 

to hold with high probability, for all n large enough. In fact, some control over provides a means of 
bounding the clustering risks, as shown in the following result. 
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Lemma 3.1. Let e £ (0,e) and h £ (0, h) be such that the conditions (CI) and (C2) are satisfied. Then, on 
the event £h,e, 

L(X + e) C L h {\) C L(X + e) U A U £ 

where 

A = L(X - e) - L{\ + e) 

and 

B = L h (X-e)-L(X-e). 
Therefore, on £h,o under the additional condition (C3), 

P (X(A)AL(A)) < C x e< + C 2 hZ. (14) 

In order to bound P(£ f ) , we study the properties of the kernel estimator ph- We will impose the following 
condition on the kernel K. 



(VC) The class of functions 



x — 



T= { K [ — — ) ,x £ R d ,h > 



satisfies, for some positive number A and v 

supN(T h ,L 2 (P),e\\F\\ L2{P) )<(^j , (15) 

where N(T,d,e) denotes the e-covering number of the metric space (T,d), F is the envelope function 
of T and the supremum is taken over the set of all probability measures on M. d . The quantities A and 
v are called the VC characteristics of T. 

Assumption (VC) appears in Gine and Guillou (2002), Einmahl and Mason (2005) and Ginc and Koltchinskii 
(2006). It holds for a large class of kernels, including for example, any compact supported polynomial kernel 
and the Gaussian kernel. See Nolan and Pollard (1987) and van der Vaart and Wcllncr (1996) for sufficient 
conditions for (VC). 

Using condition (VC), we can establish the following finite sample bound for P(||ph — Ph\\oo > e), which 
is obtained as a direct application of results in Ginc and Guillou (2002). 

Proposition 3.2 (Gine and Guillon). Assume that the kernel satisfies the property (VC) and that 

sup sup / Kl(t - x)dP(x) < D < oo. (16) 

teM d h>0 JR d 

1. Let h be fixed. Then, there exist constants L > and C > 0, which depend only on the VC character- 
istics of K , such that, for any C\>C and < e < jj^j— , there exists an uq > 0, which depends on e, 
D, ||if||oo an d the VC characteristics of K, such that, for all n > uq, 

. . . , 1 T f 1 log(l + d/UL)) nh d e 2 ) 

sup \p h (x)- Ph (x)\>2e\<Lexp\- T — ^^^7" }■ ( 17 ) 

2. Let h n — > as n oo in such a way that j t "^ d | — > oo. If {e n } is a sequence such that 



'log r n 



^ = fl # ' (18) 



where r n = Cl (h n d ^ 2 ^j, then, for all n large enough, (17) holds with h and e replaced by h n and e n , 
respectively. In particular, the term on the right hand side of (17) vanishes at the rate O (Tn)- 
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The above theorem imposes minimal assumptions on the kernel K and, more importantly, on the prob- 
ability distribution P, whose density is not required to be bounded or smooth, and, in fact, may not even 
exist. Condition (16) is automatically satisfied by bounded kernels. Finally, we remark that, for fixed h, 

setting e n = y ^~fp for an appropriate constant Ck (depending on K) , an application of the Borel-Cantelli 



Lemma yields that, as n — > oo, \\ph — ph\\oa = O -^p- j almost everywhere P. 
3.2 Rates of Convergence 

We now derive the converge rates for the clustering risks defined in Section 2.2. Below we will write Ck 
for a constant whose value depends only on the VC characteristic of the kernel K and on the constant D 
appearing in (16). 

We recall that Lemma 2.3 provides a way of controlling the clustering bias due to the sets Lh(X — e) — 
L(X — e), uniformly over e < e and h < h. In fact, the parameters 9 £ {l,...,d} and £ S {l,oo} will 
determine the rates of consistency for the excees mass and the level set risk, respectively. Specifically, higher 
values of the parameter 6 which correspond to supports of lower dimension yield faster convergence rates for 
the excess mass risk. As for the level set risks, the case £ — oo is the most favorable, since it implies that 
the clustering bias has no effect on the estimation of level sets and dimension independent rates are possible. 
In particular, if Cj has dimension smaller than d, then P{C%,j\ — Q) — 0, so that £ = oo. More generally, 
£ = oo occurs when L = S. Overall our results yield that, as expected, better rates for the clustering risk 
are obtained for distributions supported on lower-dimensional sets. 

Theorem 3.3. (Level Set Risk.) Suppose that (CI), (C2), (C3) and (VC) hold. Then, there exists a 
constant Cl such that, for any h S (0, h) and e S (0, e), 



R L 



(p,Ph) < C L (f> +h£ + e -°« nhd ^ . (19) 



In particular, setting 



we obtain 



,'logn\ 7/(2C+d7) / logn 

tin = and e„ — 



C K nhi 



R^j=oLJ(^y +i \ i -\\. (20) 



n I n 



If 7 — oo, then either S — L is empty or has zero Lebesgue measure, or S — L is a full dimensional set of 
positive Lebesgue measure. The former cases, which correspond to P having a lower dimensional support or 
to sharp clusters (see Example 2.6), implies that R L — O (^). Thus we have dimension independent rates 

for sharp clusters. In the latter case, £ = 1, so that R L is of order O ({^^) J ■ When 7 < 00, then £ = 1 



and the risk is of order O 



In practice, there are examples in between the sharp and non-sharp cases for probability distributions 
with full-dimensional support. For example, if there is a very small amount of mass just outside the cluster, 
then, technically, £ = 1 and the rate will be slow for large d. However, if this mass is very small then we 
expect for finite samples that the behavior of the risk will be close to the behavior observed in the sharp 
case. We could capture this idea mathematically by allowing P to change with n and then allowing to 
vary with n and take values between 1 and 00. However, we shall not pursue the details here. 

As an interesting corollary to Theorem 3.3, we can show that the expected proportion of sample points 
that are incorrectly assigned as clusters or noise vanishes at the same rate. 
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Corollary 3.4. Let f h = where 

I h = {« : sign(p /l (X i ) - A) ^ sign(p(JQ) - A)}. 
Then, E(f h ) < C L U* + ftf + e-^"' 1 ^ 2 ) . 

We now turn to the excess mass risk. 

Theorem 3.5. (Excess Mass.) Suppose that (Cl), (C2), (C3) and (VC) hold. Then, there exists a constant 
Cm, independent of e and h, such that, for any h € (0,h) and e € (0,e) with e < X, 

R M (p,Ph) < Cm + K° + e- c ^ hd ) . 

Thus, setting 

/lognA ss+V+d / logn 

ft„ = and e„ 



C K nht 



we obtain 



9(7 + 1) 



'log n 



R (p,Ph) = \ [ — ^— (21) 



When 7 = oo the excess mass risk i? M is of order O I I . Thus, the higher 8, i.e. the smaller the 
dimension of the support of P, the faster the rate of convergence. In particular, if P is supported over a 
finite set of points the risk vanishes at the dimension independent rate O ( " J . When 7 < 00, then 9=1 

/ log n \ 2+d( 7 +i) 



and the risk is of order O 



3.3 Some Special Cases 

Here we discuss some interesting special cases. 

Fast Rates For Biased Clusters. In some cases, we might be content with estimating the level set i/j(A), 
which is a biased version of L(X). That is, the fringe Lh(X) — L(X) may not be of great practical concern 
and, in fact, it may contain a very small amount of mass. Indeed, we believe this is why clustering is often so 
successful in high dimensional problems. Exact estimation of the level sets is not necessary in many practical 
problems. In fact, by Lemma 2.2, conditions (C2) guarantees that L will include L with high probability. 
Thus, for clustering purposes, one may consider some modifications of our risk functions. First, suppose we 
only require that the estimated clusters cover the true clusters. That is, we say there is not error as long as 
Cj C Cj. This suggests the following modification of our risk functions: 

• R L (p,Ph) = f{ x . p ( x )>x}n{x: p h {x)<\} dP{x), 

• R M (p,p h )=£(L)-E(L h r\L). 

Then, we have the following result, which gives faster, dimension independent rates. The proof is similar to 
the proofs of the previous results and is omitted. 



Theorem 3.6. Let h e (0, h) be fixed. Under (C1),(C2) and (VC), then 

R L (p,p h ) = O max 



log n \ 1 



n / n 
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and 

s»wu- <>((!==)*). 

Alternatively, one may be only interested in estimating the clusters of the mollified density p^ , for any 
fixed h £ (0,h). Then, provided that p^ is sufficiently smooth (which is guaranteed by choosing a smooth 
kernel) and has finite positive gradient for each point in the set dLh(X), the results in Section 2.5 show that, 
for all e small enough, 

fi({x: \ Ph (x) - A| < e}) < e. 

Thus, under assumptions (C2) and (VC), similar arguments to the ones used in the proofs of Theorems 3.3 
and 3.5 imply that 



R L (p h ,Ph)= I dP(x) = 

J{x: p h (x)>\}A{x: p h (x)>\} 

and 

R M (Ph,Ph) = £(L h ) - E (Z{L H )) = O 
In either case, we get dimension independent rates. 

The Smooth Full- Dimensional Case. In the more specialized settings in which P has full-dimensional support 
and the Lebesgue density p is smooth, better results are possible. For example, using the same settings of 
Rigollet and Vert (2006), if p is /3-times Holder differentiable, then the bias conditions (C2) are superfluous, 
as 

Ibfc -Plloo < Ch 13 , (22) 

for some constant C which depends only on the kernel K. Choosing h such that Ch 13 < e, on the event £h,a 
the triangle inequality yields \\p h — p||oo < 2e. Thus, for each e < | and each h such that Ch 13 < e, on £ h ^ e , 
instead of (14), one obtains 

P (l h (X)AL(X)) <C^e\ 

Then, setting h n = (log n/n) 2 ?+ d and e„ = f2((logn/n)) 2 ' 3 + d ), we see that R (p,ph) is of order 0((log n/n) w+<i) : 

while R (p,Ph) is of order O ((log n/n) 2 > 3 + d ). These, are, up to an extra logarithmic factor, the minimax 
rates established by Rigollet and Vert (2006). In fact, under these smoothness assumptions, and since the 
bias can be uniformly controlled as in (22), then, by a combination of Fubini's theorem and of a peeling argu- 
ment as in Audibcrt and Tsybakov (2007) and Rigollet and Vert (2006), the exponential term O ^ e - c i<nh d e- s j 
becomes redundant and rates without the logarithmic term are possible. 

4 Choosing the Bandwidth 

In this section we discuss two data-driven method for choosing the bandwidth that adapts to the unknown 
parameters 7 and 6. Before we explain the details, we point out that Li cross-validation is not appropriate 
for this problem. In fact, we are allowing for the case where P may have atoms, in which case it is well 
known that cross-validation chooses h = 0. 

4.1 Excess Mass 

We propose choosing h by splitting the data and maximizing an empirical estimate of the excess mass 
functional. Polonik (1995) used this approach to choose a level set from among a fixed class C of level sets 
of finite VC dimension. Here, we are choosing a bandwidth, or, in other words, we are choosing a level set 
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1. Split the data into two halves which we denote by X = (X±, . . . ,X n ) and Z = 

(Zi, . . . , Z n ). 

2. Let Ti be a finite set of bandwidths. Using X, construct kernel density estimators 
{p h ■ h € TL}. Let L h = {x : Ph(x) > A}. 

3. Using Z, estimate the excess mass functional 

1 " 

£(h) = ~Y,I(Zi € L h ) - \fi(L h ). 
»=i 

4. Let h be the maximizer of £(h) and set L = Lr. 

Table 1: Selecting the bandwidth using the excess mass risk. 



from a random class of level sets C = {{x: Ph(x) > A} : h > 0} depending on the observed sample X. The 
steps are in Table 1. 

To implement the method, we need to compute fi(Lh). In practice fi(Lh) can be approximated by 

1 I(p h (Uj) > A) 

M jr[ g{Ui) 

where U±, . . . , Um is a sample from a convenient density g. In particular, one can choose g — pjj for some 
large bandwidth H . Choosing M w n 2 ensures that the extra error of this importance sampling estimator 
is 0(l/n) which is negligible. We ignore this error in what follows. 

Technically, the method only applies for A > 0, at least in terms of the theory that we derive. In practice, 
it can be used for A = 0. In this case, £(h) becomes 1 when h is large. We then take h to be the smallest h 
for which £(h) = 1. 

Below we use the notation £x{') instead of £ (•) to indicate that the excess mass functional (7) is evaluated 
at a random set depending on the training set X and, therefore, is itself random. Accordingly, with some 
abuse of notation, for any h > 0, we will write £x(h) — £{Lh), with Lh the A- level set oipt- Below Ti is a 
countable dense subset of [0, h). The next result is closely related to Theorem 7.1 of Gyorfi et al. (2002). 

Theorem 4.1. Let h* — aima,yih^u£x{h) . For any 5 > 0, 

E(£x(K)) - E(£x(h)) < d(6, k) 1 + log2 (23) 

n 

where the expectation is with respect to the joint distribution of the training and test set, d(8,K) = -6(1 + 
<5)(l6 7 2 + (5(7+ 167 2 )) ; with k = 2 + \[i(S + B(0,h)) and 7 2 = |(e 4 / 7 - l). 

Now we construct a grid TL n of size depending on n that is guaranteed to ensure that optimizing over 
TL n implies we are adapting over 7 and 9. 

Theorem 4.2. Suppose (CI) and (C2) hold. Let 

eld 

t 1 



where a n = (log n/n) and 



A n (0) = 2|1Oga " K 6 



(29 + dy 
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Let G n {9) — {71(0), ■ ■ ■ , 7;v(0)(0)} where Jj(0) = (j — l)S n (9) and N(9) is the smallest integer less than or 
equal to T n (9)/5 n (6), 

T (0) 2 ° 2 26 1 

~ ^PWn ~ It 

and 

log 2 



log n — log log n 
Let 

n n = {h n (^,e) : 0e{l,...,d}, 7 eG„(0)} 

where h n (p/,ff) = a^ +1 ^^ 2e+d ^' 1+1 ^\ L e t L fr e obtained by minimizing £{h) for h S 7i„. Then 



£(L)-E(£(L)) < O 



log n 



8(t+1) 
2£) + d( 7 + l) 



The latter theorem shows that our cross-validation methods gives a completely data-driven method for 
choosing the bandwidth that preserves the rate. Notice, in particular, that adapting to the parameter 9 
is equivalent to adapting to the unknown dimension of the support of P. This makes it possible to use 
our method in practical problems as long as the sample size is large. For small sample sizes, data splitting 
might lead to highly variable results in which case our bandwidth selection method might not work well. An 
alternative is to split the data many times and combines the estimates over multiple splits. 

When fi(L) = 0, we have that h# = 0. The above theorems are still valid in this case. Thus the case 
where P is atomic is included while it is ruled out for L2 cross-validation. 



4.2 Stability 

Another method for selecting the bandwidth is to choose the value for h that produces stable clusters, in a 
sense defined below. The use of stability has gained much popularity in clustering; see Ben-Hur et al. (2002) 
and Lange et al. (2004) for example. In the context of k-means clustering and related methods, Ben-David 
et al. (2006) showed that minimizing instability leads to poor clustering. Here we investigate the use of 
stability for density clustering. 

Suppose, for simplicity, that the sample size is a multiple of 3. That is, the sample size is 3n say. Now 
randomly split the data into three vectors of size n, denoted by X = (Xi, . . . , X n ), Y — (Yi, . . . , Y n ) and 
Z = (Zi, . . . , Z n ). (In practice, we split the data into three approximately equal subsets.) 

We define the instability function as the random function 3 : [0, 00) 1— » [0, 1] given by 

S(h) = p(p h ,q h ,P z )= f dP z {x), (24) 

J{x: p h (x)>\}A{x: q h (x)>\} 

where ph is constructed from X, qh is constructed from Y and Pz is the empirical distribution based on Z. 

Rather than studying stability in generality, we consider a special case involving the following extra 
conditions. 

1. Sharp Clusters. Assume that P — TlT—i ^jPj where TTj — 1, and Pj is uniform on the compact set 
Sj of full dimension d. Thus, p(z) = Ajl(z £ Sj) where Aj = ttj/ p,(Sj). Let A = mim,- Aj > and 
let A = maxj A j . 

2. Spherical Kernel. We use a spherical kernel so that 

= J_ y I(\\z-Xi\\<h) = P{B{ Xl h)) 

nh d ^ Vd h d Vd 

i—l 

where Vd = n d > 2 /T(d/2 + 1) denotes the volume of the unit ball and P is the empirical measure. 
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3. The support of P is a standard set. Letting S — U^L-^Sj, we assume that there exists a 8 £ (0, 1) such 
that 

(J,(B(z, h) n L) > 5^{B{z, h)) for all z <E S, and all h < diam(S), 

where diam(5) = sup^ y ^ c s \\ x ~ 2/11 indicates the diameter of the set S. This property appears in a 
natural way in set estimation problems; see, e.g., Cuevas and Fraiman (1997). 

4. Choice of A. We take A = 0, so that L = S. 

Under these settings, the graph is typically unimodal with 5(0) = S(oo) = 0. Hence, it makes no 
sense to minimize 5. Instead, we will fix a constant a € (0, 1) and choose 

h = inf\h: sup H(i) < a \. (25) 

[ t>h J 

Theorem 4.3. Let h* = diam(L). Under conditions 1-4, 
1. 5(0) = and E(h) = 0, for all h > K; 
2- ™Po<h<k, E(S(/i)) < 1/2; 
3. Ash^ 0, E(E(h)) x 
^. /or eac/i /i G (0, ft.*), 

£> 3 (/i* - /i) d ( n+1 >£>J < E(S(/i)) < 21>i(/i* - h) n+1 D™ 

where 

D l = WT^TTTT^ TV, #2 



2 d r((d/2) + i)' z r((d/2) + i) 



r((d/2) + i)' * r(d/2 + i)" 

To see the implication of Theorem 4.3, we proceed as follows. Consider a grid of values Ti C (0,/i*) of 
cardinality rcr, for some < /3 < 1. By Hoeffding's inequality, with probability at least 1 — A, we have that 

hen V n 

Replacing E(3(/i)) by 5(/i) + w„ and 5(/i) — iy„ in the upper and lower bounds of part ^. of Theorem 4.3, 
respectively, setting them both equal to a and then finally solving for h, we conclude that the selected h is 
upper bounded by 



» 1/(1+1) 



and lower bounded by 



2L>i 



a + id„ 



'2 



l/(d(n+l)) 



n d(n+l) 

^4 



with probability larger than 1 — A. Thus, as n — > oo, the resulting bandwidth does not tend to 0. Hence, the 
stability based method leads to bandwidths that are quite different than the method in the previous section. 
Our explanation for this finding is that the stability criterion is essentially aimed at reducing the variability 
of the clustering solution, but it is virtually unaffected by the bias caused by large bandwidths. 
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In the analysis above we assumed for simplicity that A — 0. When A > 0, the instability S(/i) can have 
some large peaks for very large h. This occurs when h is large enough so that some mode of Ph{x) is close 
to A. Choosing h according to (25) will then lead to serious oversmoothing. Instead, we can choose h as 
follows. Let ho = argmax^S(/i) and define 

ft = infj/i: h>h , E(h) < a\. (26) 

We will revisit this issue in Section 6. A theoretical analysis of this modified procedure is tedious and, in 
the interest of space, we shall not pursue it here. 



5 Approximating the Clusters 

Lemma 2.2 shows that, under mild, conditions and when the sample size is large enough, N(X) = Nh(X) 
uniformly over h G (0, h) with high probability. However, computing the number of connected components 
of Lh{\) exactly is computationally difficult, especially if d is large. In this section we study a graph-based 
algorithm for finding the connected components of Lh and for estimating the number of A-clusters iV(A) that 
is based on the p-nearest neighborhood graph of {Xi\ ph(Xi) > A} that is fast and easy to implement. 

The idea using the union of balls of radius p centered at the sample points to recover certain properties 
of the support of a probability distribution is well understood. For instance, Devroye and Wise (1980) and 
Korostelev and Tsybakov (1993) use it as a simple yet effective estimator of the support, while Niyogi et al. 
(2008) show how it can be utilized for identifying certain homology features of the support. 

In particular, Cuevas et al. (2000) and Biau et al. (2007) propose to combine a kernel density estimation 
with a single-linkage graph algorithm to estimate the number of A-clusters (see also Jang and Henry, 2007, 
for an application to large databases). Our results offer similar guarantees but hold under more general 
settings. 

The algorithm proceeds as follows. For some h g (0, ft.) and a given A > 0, 

1. Compute the kernel density estimate pu\ 

2. compute the p-nearest neighborhood graph of {AQ : ph(Xi) > A}, that is the graph Gh,n on {Xi : ph(Xi) > 
A} where there is an edge between any two nodes if and only if they both belong to a ball of radius p; 

3. compute the connected components of Q% n using a depth-first search. 

The computational complexity of the last step is linear in the number of nodes and the number of edges of 
Qh,n (see, e.g., Cormen et al., 2002), which are both random. 

We will show that, if p is chosen appropriately, then, with high probability asm oo, 

1. the number of connected components of Gh,n, N^(X), matches the number of true clusters, N(X) = k; 

2. there exists a permutation of {1, . . . , k} such that, for each j and f , 

C*C |J B(x,p) and j |J B(x,p))nl |J B{x,p)\=$, (27) 

where C\ 1 . . . ,Ck are the connected components of Gh,n- 

We will assume the following regularity condition on the densities p^, which is satisfied if the kernel K 
is of class C 1 and P is not flat in a neighborhood of A: 

(G) There exist constants ei > and C g > such that for each h € (0, h), p^ is of class C 1 on {x : \ph(x) — 
A| < ei} and 

inf_ inf \\V Ph (x)\\>C g . (28) 
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Let Sh = min^j ^ x ec h .yec h \\ x ~ u\\ an d s °t $ = ^he(o h) Notice that, under (C2) (b), 6 > 0. Finally, 
let Oh, n denote the event in equation (27), which clearly implies the event {N^(X} = k}. 

Theorem 5.1. Assume conditions (G) and (C2) and let d* = dim(L). Assume further that there exists a 
constant C such that, for every r < 5/2 and for P -almost all x G S n L, 

P(B(x,r)) >Cr d \ (29) 

where di = dim(iSi), with x G Si. Then, there exists positive constants p and M, depending on d* and L 
such that, for every p < min{(5/2, p}, there exists a number e(p) such that, for any e < r](p), 

P(Ol n )<F(£l e )+I7p- d *e-C n o d \ 

uniformly in h G (0, h). 

The previous result deserves few comments. First, the constants ~p, M and C depend on d* . Secondly, 
assumption (29) is a natural generalization to lower dimensional sets of the standardness assumption used, 
for example, in Cucvas and Fraiman (1997). It is clearly true for components Pi of full-dimensional support 
that are absolutely continuous with respect to the Lebesgue measure. Finally, in view of Lemma 9.1 (and, 
specifically, of the way e(p,r) is defined), Theorem 5.1 holds for sequences {e n }, {h n } and {p n } such that 

1. e„ = o(l), 

2. sup„ h n < h; 

3. sup„p„ < mm{5/2,p d } and e n = o(p n ). 

In particular, if h n = o(l), then, following Proposition 3.2, the term P(££ e ) vanishes if j ^}]d | - * °°- 
Interestingly enough, condition (CI) does not play a direct role in Theorem 5.1. 

We now consider a bootstrap extension of the previous algorithm, as suggested in Cuevas et al. (2000) . 
For any h, let X* = (XI, . . . ,X'^) denote a bootstrap sample from ph conditionally on {ph > A} and let 
Q* n h denote the p-neighborhood graph with node set X* . Finally, let 0* h n be the event given in equation 
(27), except that C\, . . . ,Ck are now the connected components of Q\\ n . 

Theorem 5.2. Assume conditions (C2) and (G). Suppose that there exist positive constants C and p such 
that 

inf / p h dp>Cp d . (30) 

he(Q,h) J A h C\L h (\) 

for any ball Ah of radius p <p and center in L^(A). Then, for any p < min{<5/2,p} ; there exists a positive 
number e(p) such that, for each e < e(p), 

V ((Ot n r) <V(£ c h J + Mp- d e- CN " d , 

uniformly in h G (0, h), where M and C are positive constants independent of h and p. 

The constants C, C, p and M depend on both d and S © B(0, h). In our settings, condition (30) clearly 
holds if P has full-dimensional support. More generally, we shown the Appendix B that condition (G) and 
(29) imply (30). Just like with Theorem 5.1, using Lemma 9.1, it can be verified that the theorem holds 
if one consider sequences of parameters depending on the sample size such that e n = o(l), e„ = o(p n ), 
sup„ p n < max{<5/2, p} and sup„ h n < h, provided that the conditions of Proposition 3.2 are met. 

Despite the similar form for the error bounds of Theorems 5.1 and 5.2, there are some marked differences. 
In fact, in Theorem 5.1 the performance of the algorithm depends directly on the sample size n and, in 
particular, on the actual dimension d* < d of the support of P, with smaller values of d* yielding better 
guarantees. In contrast, besides n, the performance of the algorithm based on the bootstrap sample depends 
on the ambient dimension d, regardless of d* , and on the bootstrap sample size N. By choosing N very 
large, the expression P(££ J becomes the leading term in the upper bound of the probability of the event 
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6 Examples 



In this section we consider a few examples to illustrate the methods. 

6.1 A One Dimensional Example 

In Section 4.2 we pointed out that when A > and large, it is safer to use the modified rule h = mi{h : h > 
ho, < ot\ where ho = argmax h S(/i), in place of the original rule h = inf{h : sup (>/l S(t) < a}. We 

illustrate this with a simple one dimensional example. 

Figure 3 shows an example based on n — 200 points from the density p that is uniform on [0, 1] U [5, 6]. 
When A = (top) the original rule works fine. (We use a = 0.05.) The selected bandwidth is small leading 
to the very wiggly density estimator in the top right plot. However, this estimator correctly estimates the 
level set and the clusters. In the bottom we have A = .3. When h is large, there is a blip in the instability 
curve corresponding to the fact that the modes of Ph(x) are close to A. The original rule corresponds to the 
second vertical line in the bottom left plot. The resulting density estimator shown in the bottom right plot 
is oversmoothed and leads to no points being in the set ph > A. The modified rule corresponds to the first 
vertical line in the bottom left plot. This bandwidth works fine. 

Figure 4 compares the instability method (top) with the excess mass method (bottom). Both methods 
recover the level set and the clusters. We took A = .3 in both cases. Because A is very large, the excess mass 
becomes undefined for large h since Ph{x) < A for all x, which we denoted by setting the risk to in the 
bottom left plot. 

6.2 Fuzzy Stick With Spiral 

Figure 5 shows data from a fuzzy stick with a spiral. The stick has noise while the spiral is supported on 
a lower dimensional curve. Figure 6 shows the clusterings from the instability method and the excess risk 
method with A = 0. Both recover the clusters perfectly. Note that the excess risk is necessarily equal to 1 
for large h. In this case we take h to be the smallest h of all bandwidths that maximize the excess mass. 
We see that both methods recover the clusters. 

6.3 Two Moons 

This is a 20 dimensional example. The data lie on two half- moons embedded in M 20 . The results are shown 
in Figure 7. Only the first two coordinates of the data are plotted. Again we see that both methods recover 
the clusters. 

7 Discussion 

As is common in density clustering, we have assumed a fixed, given value of A. In practice, we recommend 
that the results should be computed for a range of values of A (see, e.g., Stuetzle and Nugent, 2009, and 
references therein). It is important to choose a different bandwidth for each A. Indeed, inspection of the 
proof of Theorem 3.5 shows that the optimal bandwidth is a function of A and that h(X) — *■ as A increases. 
Further research on data-dependent methods to choose A and p (the parameter used in the graph-based 
algorithm of Section 5) would be very useful. 

We discussed the idea of using stability to choose a bandwidth. We saw that the behavior of the selected 
bandwidth is quite different than with the excess mass method. This method seems to work well for density 
clustering unlike what happens for fc-means clustering (Ben-David ct al., 2006). We believe that the stability 
method deserves more scrutiny. In particular, it would be helpful to understand the behavior of the stability 
measure under more general conditions. Also, the detailed theoretical properties of the modified method for 
selecting h based on stability should be explored. 
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Figure 3: The left plots show the instability as a function of log bandwidth. The horizontal line shows 
a = 0.05. The right plots show the true density and the kernel density estimator based on the selected 
bandwidth h. In the top plots, A = 0. In the bottom plots, A = .3. 
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Figure 4: The top left plot shows the instability as a function of log bandwidth. The top right plot shows 
the true density and the kernel density estimator based on the selected bandwidth h using the modified rule. 
The bottom left plot shows the estimated excess mass risk as a function of log bandwidth. The top right 
plot shows the true density and the kernel density estimator based on the selected bandwidth h obtained 
by maximizing the excess mass. In both bottom plots, A = .3. Both methods recover the level set and the 
clusters. 
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Figure 5: 500 data points from a fuzzy stick plus a spiral. 
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Figure 6: Clusters obtained from instability (top) and excess mass (bottom). 
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Figure 7: Clusters obtained from instability (top) and excess mass (bottom). The data are in K 20 but only 
the first two components are plotted. 



24 



Finally, we note that there is growing interest in spectral clustering methods (von Luxburg (2007)). We 
believe there are connections between the work reported here and spectral methods. 
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9 Proofs 

Proof of Lemma 2.1. The weak convergence follows from the fact that P is a Radon measure (sec, e.g., Lconi 
and Fonseca, 2007, Theorem 2.79). As for the second part, if x S Si, where Si has Hausdorff dimension 
d, then p[x) = KiPi(x), with Pi a Lebesgue density, and the result follows directly from Lconi and Fonseca 
(2007, Theorem 2.73, part (it)). See also Appendix A. On the other hand if di < d, then it is necessary to 
modify the arguments as follows. Since K is smooth and supported on 5(0,1), there exists a rj such that 

K (\^) > rj if \\x — y\\ < r\h. Set C — j —^ d — S where v di is the volume of the unit Euclidean ball in R di . 
Then, 

Phi') = ^I Si nB {x , h) K( x -^)dP(y) 

- ! Si nB(x,r)h) dP(y) 

= Vh'-ff va-(vh) d * P *( B ( X ' q*)) 
= C Pi(B(x,r,h)) 

As as h — > 0, Pi ^ d B ^^dP —* Pi{%) < oo, by (39) almost everywhere Ti di , while h{d C di) — > oo, thus showing 
that Yan.h^,QPh{x) = oo. ■ 

Proof of Lemma 2.2. By assumptions (C2), for any < e < e and < h < h, 

N h (X - e) = N h (X) = N h (X + e) = N(X) = k. 

On the event Sh,e it holds that 

£ft(A + \\ph - Ph\\oo) C L h (X) C L h (X - \\ph - Ph\\oo), 
which implies that, on the same event, 

k = N h (\+\\p h -p h \\ 0O ) < N h (X) < N h (X — \\p h — ph\\oo) = k. 



Proof of Lemma 2.3. Recall that Kh is supported on B(0, h). For the first claim, it is enough to show that, 
for any e e [0,e), L h (X-e)-L(X-e) C dL(X-e)+B(0,h). Indeed, by (C3), fj, (dL(X - e) © B(0, h)) < C 2 h e , 
which implies (10). Thus, we will prove that, if w £ dL(X — e) © B(0, h), then w £ Lh(X — e) — L(X — e). For 
such a point w, either p(w) > A — e or, by conditions (C2), p(z) < X — e for every z € B(w,h). Since the 
kernel K has compact support, the latter case implies that Ph{w) < A — e as well. Therefore, 

w 6 {x : p(x) > X — e} U {x: ph{x) < X — e} 
= {x: p(x) < A - e,ph(x) > A - e} c 
= (L h (X-e)-L(X-e)) c . 
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As for inequality (11), it is enough to observe that the set 



I ht6 = (L h (X-e)-L(X-e))nS 

either has zero probability (because it is empty or has Lebesgue measure 0) or has positive Lebesgue measure. 
In the former case we obtain £ = oo. In the latter case, Ih, e must be full-dimensional, so that, by (10), 
M(^i.e) ^ C^h, for all h € (0, h). Since p is bounded by A on Ih,n, we obtain 

P(L h (X - e) - L(X - e)) = P{I h , e ) < ^C 2 h = C 3 h, 

which implies that we can take £ = 1. B 

Proof of Lemma 2.4- Since p is Lipschitz and integrable, p _1 (A) is W ^measurable, so the integral H d_1 ({x: i 
A}) is well defined for A S (0, |jp||oo), where 7i d_1 denote the (d — l)-dimensional Hausdorff measure in M. d . 
Furthermore, we can use the coarea formula. See Evans and Gariepy (1992) and Ambrosio et al. (2000) for 
backgrounds on Hausdorff measures and the coarea formula. By the Rademacher Theorem, the set Ei of 
points where p is not differentiable has Lebesgue measure zero. By Lemma 2.96 in Ambrosio et al. (2000), 
the set E 2 = {x: ||Vp(x)|| = 0} is such that H d - 1 {p- 1 (X) H E 2 } = 0, for all A € (0, ||jp||oo) outside of a set 
E$ C K of Lebesgue measure 0. Without loss of generality, below we may assume that E\ and E 2 are empty. 
Thus, we can assume that, for any A G (0, ||p||oo) H E%, there exists positive numbers e, C and M such that 

(i) mf xe{x . | p ( x )-A|<e} ||Yp(sf)|| > C, almost everywhere-/!; 

(ii) sup^^n^dx: p(x) = A + ry}) < M. 

Then, for each e £ (0,e), 

P({x: \p(x) -X\< e}) = J p(x)l { \ p ^ x \<e}dfJ,(x) 

= I \\Vp(l)\\ 1 {\ P (x)-x\<c}\\^P(x)\\d^(x) 

= c:h^ + u )} Mndn--\x)du 

= f+:(X + u) f ( (llVpix)]])- 1 dH n -Hx)du 

C £ ' 

where the second equality holds because ||Vp(x)|| is bounded away from on {x: \p(x) — A| < e} by (i), the 
third equality is a direct application of the coarea formula (see, e.g., Proposition 3 page 118 in Evans and 
Gariepy, 1992) and the last inequality follows from (i) and (ii). ■ 

Proof of Corollary 2.5. Following the proof of Lemma 2.4 and using our additional assumption that p is of 
class C , without any loss of generality, below we can assume that the set E\ and E 2 are empty and we recall 
that E3 has Lebesgue measure 0. Let A E3 be such that 

inf ||Vp(x)|| >0. 

xep- 1 (x) 

We now claim that there exists a non-empty neighborhood U of A for which 

inf inf ||Vp(x)|| > 0. 
\eUxep- 1 (x) 

Indeed, arguing by contradiction, suppose that the previous display were not verified for any neighborhood 
U of A. Then, there exist sequences {A„} C R and {x„} C S such that lim„ A„ = A, and x„ S p~ 1 (X n ) and 
Vp(x„) = for each n. By compactness, it is possible to extract a subsequence {i„J of {x„} such that 
x nk — * x, for some x G p~ (A). Since p is of class C 1 , this implies that ~S/p(x nk ) — » Vp(x) as well. However, 
\Ip(x nk ) = for each k by construction, while Vp(x) ^ 0. This produces a contradiction. Thus, for each A 
that is not a critical point, one can find a neighborhood of positive length containing it and, by Lemma 2.4, 
CI holds at A with 7=1. Since, using compactness again, ||p||oo < 00, this implies that there can only be a 
finite number of critical points for which 7 may differ from 1. ■ 
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Proof of Lemma 3.1. Since e < e and h < h, in virtue of (C2) (b) it holds that, on £h,e, 

L h (X)DL h (X + e)DL(X + e), 

and 

L h (X) C L h (X - e) = L(A - e) U (L h (A - e) - L(X - e)) . 
Because L(A + e) C L(A) C L(X — e), the above inclusions imply, still on £h,e, that 

L h (X)AL(X) C (L(A-e)-i(A+e))U(L h (A-e)-L(A-e)) 
= iUfl, 

where it is clear that the sets A and B are disjoint. Taking expectation with respect to P of the indicators 
of the sets L/ l (A)AX(A), A and -B and using condition (CI) and Lemma 2.3 yield (14). ■ 

Proof of Proposition 3.2. The claimed results are a direct consequence of Corollary 2.2 in Gine and Guillou 
(2002). We outline the details below. We rewrite the left hand side of (17) as 



£/(X0-E[/(X0] 




where 



T h ={K[ — ),x& 



and then proceed to apply Gine and Guillou (2002, Corollary 2.2). Following their notation, we set t = nh d e 
and. since, 



sup Var[/] < sup / K 2 [ J dP(x) < h d D, 



fe^h 

we can further take a 2 — h d D and U — C\\K \\oo, where C is a positive constant, depending on h, such that 
a < U/2. Then, conditions (2.4) (2.5) and (2.6) of Gine and Guillou (2002) are satisfied for all n bigger than 
some finite n , which depends on the VC characteristics of K, D, \\KWao, C and e. Part 2 is proved in a 
very similar way. In this case, we take the supremum over the the entire class T and we set <r^ = h^D and 
U = \\K\loo. For all n large enough, condition (2.5) is trivially satisfied because h n — o(l), while equations 
(2.4) and (2.6) hold true by virtue of (18). The unspecified constants again depend on the VC characteristics 
of K, D, and ||if||oo- ■ 

Proof of Theorem 3.3. We can write 

E(p(p,p h ,P))=E\ [ dP;£ h A+E[ f dP- t £l e ), (31) 

\JL h (\)AL(\) ) \JL h (X)AL(X) ) 

where for a random variable X defined on some probability space (f2, P) and an event £ C T ', E(X; £) = 
InnA j ^( w )^lP( Ci; )- Using Proposition 3.2, the second term on the right hand side is upper bounded by 

P(£h, e ) < Le~ nCKhde \ (32) 

As for the first term on the right hand side of (31), without loss of generality, we consider separately the 
case in which the support of P has no lower-dimensional components and the case in which it of lower 
dimension. The result for the cases in which the support has components of different dimensions follows in 
a straightforward way. 

If the support of P consists of full-dimensional sets, then, on the event £h,<n 

h h (x)AL W dp < P(L(X-e)-L(X + e)) + P(L h (X-e)-L(X-e)) 
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where the first inequality stems from (14) and the second from conditions (CI) and (11). 

If instead P has lower dimensional support, then, because, on the event £h,t, Lh C Lh(X — e) and because 
L C Lh(X — e) by (C2) (b), we see that, on 

dP = 0. 

Lh (A)AL(A) 

We conclude that ¥.(p(p,ph, P); £h,e) is bounded by max{Ci, C3}(e 7 + h^) if the support of P contains a full 
dimensional set and is otherwise. This, combined with (32), yields the claimed upper bound on the level 
set risk with Cl — max{Ci, C3, L}. The convergence rates are established using simple algebra. Notice that 
the choice of the sequences {e„} and {h n } does not violate condition (18). ■ 

Proof of Corollary 3.4- For each i € {1, . . . , n}, 

P (* G T h \£h,e) <P(l,'6 L h AL\e h , e ) < maxIC^CaKe 7 + ^)p^j 
where the last inequality is due to Lemma 3.1. Thus, 



E(|4|) < EtiP^G/l^eJP^ + nP^) 

< n(max{Ci,C 3 }(ei + hi) 

< C L (e-y + h* + e -OKnh^\ . 

Proof of Theorem 3.5. From equation (9), we have 

£{L)-£{L h ) = f |po-A|d/i + Pi(L)-Pi(L fe ), 

JL h AL 

where po — ^jjj-. Since, on the event £h,e, Lh D L^(A + e), we obtain, on the same event, 

Pi(L) - Pi(L h ) < P 1 (L) - Px(L h + e) = 0, 
where the last equality is due to condition (C2) (b). Therefore, 



£{L) - £{L h ) < f \ Po - X\dp, 

JL h AL 



Just like in the proof of Theorem 3.3, we treat separately the case in which the support of P is of lower- 
dimension and the case in which it consists of full-dimensional sets. If the support of P is not of full 
dimension, then, on £/> e , 

£{L) - £{L h ) < X^(L h AL) < Xfi(L h (X - e) - L(X - e)) < XC 2 h e , 

by (10). On the other hand, if the support of P has no lower-dimensional components (so that po = p), still 
on the event £h,e and using Lemma 3.1, 

/ \p-X\dfj,< [ \p-X\dn+ [ \p-X\d/j,. (33) 

J{L h (\)AL(X)} J L(\-e)-L(\+e) J L h (\-e)- L(A-e) 

The first term on the right hand side of the previous inequality can be bounded as follows. 

J*L(A-e)-L(A+e) \P ~ MM*) = S{x: \p(x)-X\<e} \P ~ MM*) 

^ e I{ X : \ P (x)-X\<e}Mx) 
= jhl{x: \p(x)-\\<e}( X ~ £ ) d l i 

< C^- e 7+l_ 
— A— e 
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where the last inequality is due to condition (CI). As for the second term of the right hand side of (33), 

\p - X\dfi < X/j, (L h (X ~e)~ L(X - e)) < XC 2 h e , 

L h (X-e)-L(X-e) 
by (10). 

Thus, we conclude that E \£(L) — £(Lh); £ h,f ^j is bounded by XCih e if the support of P is a lower 
dimensional set and by 

max j A C 2 ,^j(^+ 1 + / l e ) 

otherwise. Next, by compactness of S, and using (32), 

E(s(L)-£(L h );£l e ) < (l + A M (S + 5(0, h))) F(£ h>e ) < C S {1 + X)L e - nC « hde \ 

for some positive constant Cs, uniformly in h < h. The claimed upper bound on the excess mass risk now 
follows by taking Cm = max |AC2, Cg(l + A)l|. The convergence rates can be easily obtained by 
simple algebra. Notice that the choice of the sequences {e n } and {h n } does not violate condition (18). ■ 

Proof of Theorem 4-1- This follows by combining the version of Talagrand's inequality for empirical processes 
as given in Massart (2007) with an adaptation of the arguments used in the proof of Theorem 7.1 in Gyorfi 
ct al. (2002). For completeness, we provide the details. 
Define h = argsup /lgW £(L^), where 



1 ™ 

8(L h ) = -J2l(Z i £L h )-Xv(L h ). 



n 

i=i 

and /i* = a,rgsup h&n £x (Lh) ■ Set T(h) = £x{Lh,) — £x{Lh), where h e TL. Recall that both Lh* and 
i/j = {x: ph > A}, are random sets depending on the training set X. We will bound E(r(ft)), where the 
expectation is over the joint distribution of X and Y. 
We can write 

E(T(h)\X) = E(T(h)\X)-(l + S)f(h) + (1 + S)f(h) 

Ti T 2 

where f(h) = £(L h J - £(L h ). Note that 

f(h) = £{L Ti ) - £{L K ) < £{L K ) - £{L K ) = 0. 

Thus, E(T 2 |X) < 0. We conclude that 

E(T(h)) - E(E(T$)\X)) = E(E(Ti|X)) +E(E(T 2 |X)) < E(E(Ti|X)). (34) 

Now we bound E(Ti\X). Consider the empirical process 

Z = sup r(/i), 
hen 

so that Z = f(h) and E(T(h)\X) = E(Z\X). We have 



F(Ti>s\X) = F\E(Z\X)-(l + S)Z)>s 



X 



E(Z\X)-Z> S -±f^ 



X 
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Notice that, conditionally on X, Z = ^ sup he7i Y^i=i fh(Yi), where, for each S H, / k : R d n M is the 
function given by 

f h {x) = I(x e L h J - \fi(L h J - (I(x e L h ) - Xn{L h )) . 

with \\fh\\oo < «. Let a 2 = E(A sup hen £" =1 f%(Yi)\X) and notice that cr 2 < reE(sup h f(h) \X) = kE(Z\X). 
Thus, 

s + Sa 2 /re 



P(Ti > s|X) < P E(Z|X) - Z > 
which, by Corollary 13 in Massart (2007), is upper bounded by 

2exp 



4(4 7 2 cr 2 + |ree) 

Then, some algebra (see Problem 7.1 in Gyorfi ct al., 2002) yields the final bound 

P(Ti > s\X) < 2exp- 

where d(8, re) is given the in the statement of the theorem. 
Set u = log 2. Then. 



d(5, k) J 



E(Ti|X) = / P(Ti > s|X)ds < u+ / P(Ti>s|Jf)ds 



2d(<5, re) f nu 



u H exp 

d(S, re) 



n |_ <i(o~, re) 

1 +log2 

n 



From (34) we conclude that 



and so 



which implies that 



E(T(h)) < d(5, re) 1 + 1 ° g2 . 



E(M(h)) <E(M(h*)) + d(5, re) 1+1 ° g2 



E{£ (h)) > E(£(h,)) - d(6, re) 1+1 ° g2 . 



This shows (23). 

fl(-y+l) 

Proof of Theorem 4.2. Define r„(7, 9) = ( -^-^ ] . For each 9, r n (j, 9) is decreasing in 9 and 

r„(T„(0),0)<2r„(oo,#). 

Hence, inf 76 [ x„(9)] r ri(7> — 2inf 7 > r(j, 9). Some algebra shows that \dr n (j, 9)/dj\ < A n {9) for all 7 and 
9. Therefore, 'for each 3, r n ( l3 {9),9) = r n (jd n (9) + 5 n (6),0) > r n (jS n (9),9) - S n (9)A n (9) > r„( 7j (0), 0)/2. 

Let h n = h(i,Q). By Theorem 3.5, R M (p,p h J = O ( (log n/n) sww 1 ! . Let h* € H„ minimize R M (p,p h ) for 
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ft e H n . Then, R M ( P ,p h ») < 2R M (p,p hn ). So, 

R M ( Pl p~ h ) < d(S,K)^^+R M (p,p, i 
< d(S,K) - ' +2R M (p,p h J 



1 - 


flog 


;2 




?; 




1 - 


flog 


;2 




n 




1 - 


flog 


;2 



= rf(<5,K) -^+2r„( 7 ,0) 

= O 



8(7 + 1) 

log n\ 2»+<*(7+i) 



Proof o/ Theorem 4.3. (1) When ft, = 0, {p h > X} = X and [q h > A} = Y so that {p h > A}A{$X > A} = 
(X, Y). Since P has a Lebesgue density, with probability one, dPz puts no mass on (X,Y) and, therefore, 
3(0) = 0. By compactness of S, if h > diam(S'), then llp/Jloo = ||<7/j||oo = 1 1 i with the supremum attained 

ft 

by any z £ S. Thus, as ft — > oo, ||J5/i — §/j||oc - * and consequently, H(oo) — > 0. 
(2) Note that 

5(ft) = p(p h ,q h ,Pz)= I dPz(z) 

J{p h >X}A{q h >\} 



I(Ph(z) > X,q h (z) < X)dP z (z) + I I{p h {z) < X,q h (z) > X)dP z {z). 

Define ((h) = E(E(h)\X,Y). Then, 

((h) = p(p h ,q h ,P) 

I(Ph(z) > X,q h (z) < X)dP(z) + I I(p h (z) < X,q h (z) > X)dP(z) 



i 2 / I(p h (z) > X,q h (z) < X)dP(z), 

where = denotes identity in distribution. Let Tth(z) = V(ph(z) < A) = F(qh(z) < A). By Fubini's theorem 
and independence, 

E(E(h)) = E(((h)) 

= 2 [ F(p h (z) > X,q h (z) < X)dP(z) 



2 / F(p h (z) > X)V(q h (z) < X)dP(z) 

JR d 

2 [ 7T h (z)(l-7T h (z))dP(z). (35) 



Since 7T/j(z)(l — tth(z)) < 1/4 for all n, h and z, (2) follows. 

(3) Let W = (X, Y) be the 2n-dimensional vector obtained by concatenating X and Y and define the 
event 

A h = {B(W U h) n B{Wj,h) = 0, Vi^ i} . 

Let ft be small enough such that Xnh d Vd < 1 (trivially satisfied if A = 0). Then, for any realization w of the 
vector W for which the event Ah occurs, 

2n 



I(p h (z) > A, q h (z) < X)dP(z) = P{B{w h ft)). 
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By our assumptions, 

2n 

2nSh d v d < P(B(wi, h)) < 2nKh d v d . 

i=l 

Using the union bound, we also have 

P(^)< ( 2 2 n )(2^)%A. 

Thus it follows that, for fixed n, E (£(h)) — > as h — > according to 

2nSh d v d < E < /i%2Amax {2 d n(2n - 1), 2n} . 

(4) By the same arguments used in the proof of point (1), for all h > h*, = almost everywhere with 
respect to the joint distribution of X and Y, and, therefore, E(£(/i)) = 0. Thus, we need only to consider 
the case < h < /i*. 

Set p z .h — P(B(z, h)) and denote with X z _h a random variable with distribution Bin(n,p Z) /,). Then, 

P(p h (z) = 0) = ¥{Xz,h = 0) = (1 - Pz , h ) n . 

For each z e 5, set D(z,h) — {z' e S: \\z — z'\\ < h} and Sh = {z: D(z,h) ^ S}. Furthermore, set 
Ph,max — su Pzts h {Pz,h} an d Ph,min — inf zG s h {Pz,h}- Then, the expected instability cam be written as 

E(E(h)) = 2 [ ir h (z)(l - n h (z))dP(z) 
Js h 

so that A h , n < E(S(/i)) < B h ^ n , where 

A h . n = 2P(S h )(l - Ph . max ) n (1 - (1 - Ph , mi n) n ) , 

B h . n = 2P(5fc)(l-Pfc,min) n (l-(l-Pfc,max) n ). 



We will now upper bound Bh tn /2. For the first term we proceed as follows. There exists a sphere 
E = B(zo, h*/2) such that S C E. (For example, choose any two points z, z' such that \\z — z'\\ — h* . Set 
z Q = (z + z')/2.) Let A = B(z ,h*/2) - B(z ,{h* - h)/2). We claim that S h C A. This follows since if 
z E A C DS then z G B(z ,h/2) and then sup z , eS \ \z - z'\\ =< sup zeB(z(hh/2)z , eB(za j u/2) \\z-z'\\ = h. Thus 
if z e S h then z e AnS C A. Hence 

P(5 h ) < P(A) < AKA) = A ^'^^f^^ < - *) 

where 

n d/2 h d-i 



2 d r((rf/2) + 1) ■ 

For the second term, let zo = argmin z p Zj /j. Then, 

1-Pfc,min = l-P(B(zo,h)) = P(B(z,K))-P(B(zo,h)) 

= P(B(z, K) - B(z , h)) < Afi(B(z, K) - B(z , h)) 

- A r((d/2) + i) - D ^~ h ) 

where D 2 = r((d/2)+i) • The third term is bounded above by 1. Hence, B n < D\D 2 (h* — h) n+1 . 
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Now we lower bound Ah, n /%- First we claim that Sh contains the intersection of a sphere of radius r/2 
where r = h* — h, with S. Indeed, let z E Sh- Then there exists z' £ S such that \\z — z'\\ < h* = h + r. Let 
w G B(z', r/2). By the triangle inequality, \\w - z\\ > h + r/2. So B(z', r/2) n S C S h . Therefore, 

P(S h ) > P(B(z',r/2)nS)>Ap(B(z',r/2)nS) 
> 6Afi(B(z / , r/2)) = D 3 (K - h) d 

where D 3 = r{{ j /2)+1) ■ 

To lower bound the second term, Let zq = argmax z p z .; l . Then, 

- l-P{B{z ,h))=P{B{z,K))-P{B{z ,h)) 

= P{B{z, K) - B(z , h)) > Ap>((B(z, h,) - B(z , h)) n 5") 

(h - h) d Tr d / 2 
> A6»((B(z,h*) - B(z ,h))) = ^ [ ^ d/2 > + l} 

= D 4 (K - h) d 

where D<± = r ~ d / 2 +i) • Thus, (1 — p^max)" > D%(h* — h) . For the third term, argue as above that 

1 — Ph,min < ^(/i* — ft) so the third term is larger than 1/2 when /i is close enough to /i*. Hence, 

A„ > ^D2(K - h) d ^ n+1 \ m 

Proof of Theorem 5.1. By our assumptions (see Section 2.1), 

n P(B{x,r)) 

< hm — — < oo 

where di = dim(S' i ), for any x outside of a set of Pj measure zero. By Theorem 5.7 in Mattila (1999), di is 
also the box-counting dimension of Si- Thus, d* — max^c^. Combined with (29) this implies that, without 
loss of generality, we can assume that there exist constants C > and ~p > such that for every ball B of 
radius p <p and center in L(X), P(B) > Cp d . 

Let A be a covering of L(X) with balls of radius p/2 and centers in L(X), with p < J>. By compactness of 
L, \A\ < Mp~ d , where M depends on d* and L(X) but not on p. 

Next, by Lemma 2.2, on the event £h, e = {\\Ph — Ph\\oo < e}> the set Lh consists of k disjoint connected 
sets. Since p < 8/2, this implies, on the same event, that Nf/{X) > k. Thus, on the event £h,e, for some 
e < t\ to be specified below, a sufficient condition for the event Oh, n to be verified is that every A G A 
contains at least one point from the set = {i: ph(Xi) > X} (similar arguments are used also in Cuevas 
et al., 2000; Biau et al., 2007). We conclude that the probability of 0£ n is bounded from above by 



P(£h e) + Mp~ d * sup P ({Xi <?A,Vie J h \ n £ h>e ) 



Since, on the event £h,e the set Jh n = {«: Ph n {Xi) > A + e} is contained in Jh, we further have that, for each 
A G A h , 

p({x^A,ViG J h }n£ ht€ ) < (l-P(An{ Ph >X + e})) n , (36) 
where the inequality stems from the identity among events 

{X, £ A, Vz G J h } = f) {{ Phn {Xi) > X + e} n A c } U { Phn (X t ) < X + e}} , 

i 

and the independence of the Xj's. By Lemma 9.1, for any fixed < r < 1/2, there exists a point y G 
L(X) n L h (X + e) such that B (y, ^) C AnL h (X + e), for all e < e(p,r). Thus, 



P(4nL & (A + e))>P(IJ (y,^JJ >C[ 



33 



for all e < e(p, r), where the second inequality is verified since ^ < p. Set e(p) = min{ei, e(p, r)}. The result 
now follows from collecting all the terms and the inequality (1 — x) n < e~ nx , valid for all < x < 1. ■ 

Proof of Theorem 5.2. Let Ah be a covering of Lh(X) by balls of radius p/2 and centers in L/j(A). By the 
same arguments used in the proof of the theorem 5.1, the probability of the event (0£ n ) c is bounded by 

p(£ c he ) + Jip- d sup F({x;#A,vj}ne h>t ), 

AeA h 

where the probability is over the original sample X = [X\,...,X n ) and the bootstrap sample X* = 
(X'(, . . . ,X N ). Here the value of e < t\ used in the definition of the event Eh,e is to be specified below. 
Because of compactness of the support of P, M is a constant depending only d and S + B(0, h). 

For a set S C R d , we denote with S" 8 ™ the n— fold Cartesian product of S and with Px*\x=x ^ ne 
conditional distribution of the bootstrap sample X* given X = x, with x — (x\, . . . , x n ). Let £ n — {x S 
S® n '■ \\Ph — Ph\\oo < e }, where f>h is the kernel density estimate based on x. Then, for each A £ Ah, 

F{{X* £A,Vj}nE h , e ) =E x (p x , lx ((A c )® n );£n), 

where, if X ~ P, Ex(f(X);£) = fr xe£ \ f(x)dP(x). For every x £ £ n , by the conditional independence of 
X* given X = x, 



P v ,v ((A c )® n ) - (l S ^ h w fh{ - V)dv \ 
Px*\X=A{A) ) - \l f{Ph> _ x)Phiv)dv ) 



N 



where 



V(h,e) = / ( Ph + e)dp. 



By Lemma 9.1, for any fixed r < 1/2 and each h, there exists a point ?/ £ Lh(X) H 2^ (A + e) such that 
f)cinift(A+e), for all e < e(p,r). Thus, 



/ 

J At 



(p h - e)d(x > (p h - e)dp = I p h dp -ep(B(y, 



Next, 



Tp 



V(h,e)= / p h dp + ep (L ft (max{A - e, 0})) + / Phdp. 

JL h {\) Ji h (A)-L h (max{A-£,0}) 

Following the proof of Lemma 9.1, one can verify that, because of assumption (G), inf^Q ^ p (Lh(X) — L^(max{A — e, 0})) 
0. as « • I). Thus, 

! ^f'- tW > S f^ V f (1 + «!)), 

as e — > 0. Then, using (30) and the facts r < 1/2 and J Lh ^x)P h ^l I — ^ ^ or eacn we conclude that there 
exists a e(p, r) such that 

X4nL fe (A + e)(^- e ) d M d 

for all < e < e(p, r) and for some appropriate constant C, independent of p and h. Thus, 

Px*\x= x ((AT n ) < e~ NCpd 
and the results now follows by setting e(p) = min{e 1; e(p, r)}. ■ 



34 



Lemma 9.1. Assume conditions (C2) and condition (G). Then, for any < r < 1 and p > 0, there exists 
a positive number e(p, r) such that, for all e < e(p, t), 

sup sup dist(x, Lh{\ + e)) < Tp. (37) 

he(OM) x£L(X) 

Proof. The claim follows by minor modifications of the arguments used in the Appendix of Biau et al. (2007) . 
We provide some details for completeness and refer to Lee (2003) for background. Because of assumption 
(G) and in virtue of the regular level set theorem (see, e.g. Lee, 2003, Corollary 8.10), for any e £ (0, ei) and 
h £ (0, h), the set {x: Ph(x) = A + e} is a closed embedded submanifold of R d . Let r(e, h) be the maximal 
radius of the tubular neighborhood around {x: Ph(x) = A + e}. Set fh = inf e<£l r(e,h) and notice that 
fh > is positive for each h £ (0,/i). Then, following the proof of Biau et al. (2007, Proposition A. 2), if 
e < ei, for any h £ (0, h), 

sup dist(x,L h (X + e)) < C7 1 e, (38) 

xedL h (X) 

where C g in the same constant appearing in (28) (see Equation (A.l) in Biau et al., 2007). In fact, since C g 
does not depend on h, (38) holds uniformly over h £ (0,h). Set e(p,r) = sup{e £ (0,ei): Ce < Tp}. Then, 
as L(X) C L h (X) by (C2) (b), (37) is verified for each e < e(p,r). ■ 



Appendix A: The Geometric Density 

In this section we describe in detail our assumptions on the unknown distribution P. For the sake of 
completeness, we provide the basic definitions of Hausdorff measure, Hausdorff dimension and rectifiability. 
We refer the reader to Evans and Gariepy (1992), Mattila (1999), Ambrosio et al. (2000) and Federer (1969) 
for all the relevant geometric and measure theoretic background. 

Let k £ [0, oo) . The /c-dimcnsional Hausdorff measure of a set E in R d is defined as TL k (E) = lim^o Hs (E) , 
where, for S £ (0, oo], 

H k 5 {E) = |^inf |g(diam( J Bi)) fc : diam(^) < j| 

where the infimum is over all the countable covers {Ei}i e i of E, with the convention diam(0) = 0. The 
Hausdorff dimension of a set E C W 1 is 

inf{fc>0: H k (E)=0}. 

Note that H is the counting measure, while TC d coincides with the (outer) Lebesgue measure. If k < d, we 
will refer to any Ti^-measureable set as a set of lower-dimension. When 1 < k < d is an integer, H k (E) 
coincides with the fc-dimensional area of E, if E is contained in a C 1 fc-dimensional manifold embedded in 
R d . 

The set E is said to be H fe -rectifiable if k is an integer, H k (E) < oo and there exist countably many 
Lipschitz functions fi : M. k i— > K d such that 

n k ^-p/i(R*)J . 

A Radon measure v in M. d is said to be fc-rectifiable if there exists a 7i fe -rectifiable set S and a Borel function 
f:S^R d such that 

v{A) = [ f(x)dH k (x), 
JAnS 

for each measurable set A C R d . 
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Throughout this article, wc assume that P is a finite mixture of probability measures supported on disjoint 
compact sets of possibly different integral Hausdorff dimensions. Formally, for each Borcl set A C M. d and 
for some integer m, 

rn 

P{A) = Y J ^{A), 

i-1 

where tt is a point in the interior of the (m — l)-dimensional standard simplex and each Pj is a di-rectifiable 
Radon measure with compact and connected support Si, where di € {0, 1, . . . , d} and Si H Sj = 0, for each 
i j. Notice that we also have max^ Tt di (Si) < oo. By Theorem 3.2.18 in Fcdcrcr (1969), each of the lower 
dimensional rectifiable sets comprising the support of P, can be represented as the union of C 1 embedded 
submanifolds, almost everywhere P. Thus, we are essentially allowing P to be a mixture of distributions 
supported on disjoint submanifolds of different dimensions and finite sets. 

Our assumptions imply that, for every mixture component Pj, there exists a 7i di -measurable real valued 
function pi such that such that 

, , f lim^o > iixeSi 

Pi(x) = { "p* * (39) 

{ if x £ Si, 

where v di is the volume of the unit Euclidean ball in M. di . See, for instance, Mattila (1999, Corollary 17.9) 
or Ambrosio et al. (2000, Theorem 2.83). Indeed, p t is a density function with respect to Tt di since, for any 
measurable set A, 

Pi(A)= [ Pi (x)dH di (x), 
J An Si 

where Tt di denotes the ^-dimensional Hausdorff measure on M. d . 

We do not assume any knowledge of the probability measures comprising the mixture P, of their number, 
supports and dimensions, nor of the vector of mixing probabilities w. 

Recall that the geometric density is the extended real-valued function defined as 

hio v d h d 

Below we list the key properties of the geometric density. Notice, in particular, that p is not a probability 
density with respect to /i, since, in general, < f Rd p(x)d(i(x) < 1. 

Proposition .2. The geometric density satisfies the following properties: 

(i) p{x) = oo if and only if x € Si with dim(Si) < d, almost everywhere P. 

(ii) p{x) = iriPi(x) < oo if and only if x S Si with dim(S'i) = d, almost everywhere \i. 

(Hi) n({x: p{x) = oo}) = 0. 

(iv) Ifx^S, then p(x) = 0. 



(v) S = {x:p{x) >0}. 
Proof. If Si has dimension d, then, by the Lebesgue Theorem, 

p(x) = lim P ( B ( x, h ^ = mp^x) < oo, 

/^-almost everywhere on Si. Similarly, if di < d, then, by (39), 

, , P(B(x,h)) v di h d < P t (B(x,h)) 

p(x) — hm — -, = lim — -, —j = oo, 

hio v d h d UO v d h d v d .h d * 

since -^p > oo as h J, 0, 7i di -almost everywhere on Si- Thus, part (i) and (ii) follow. Part (Hi) is a direct 

consequence of (i) and (ii), while parts (iv) and (v) stem directly from the definition of support. I 
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As a final remark, even though the geometric density p is very different from the mixture densities pi, 
for our clustering purposes, we need only to concern ourselves with estimating the level sets of p. 



Appendix B 

In this section we give sufficient conditions for (30) to hold. We focus only on the case in which L has 
dimensional smaller than d. If L is full-dimensional, then it is easy to see that (30) holds. 

Lemma .3. Assume condition (G) and (29). Then (30) is verified. 

Proof. Throughout the proof, we set d(h) = inf x6 i il , e gi h \\x — y\\. 

For any c € (0, 1) define hi(c) to be the infimum of all h < h such that for every ball A of radius p < p 
and center in L} l (X) there exists a ball B c A n Lh of radius cp. Set ft-i(c) = oo when the infimum does not 
exist. Let c* = sup{c £ (0,1): h\(c) < oo} > 0. It can be seen that, for any c < c*, h\{c) < oo and that 
hi(c) — > as c J, 0. 

Next, let A r be a ball of radius r and center in L. Then, P(dA r ) = for all r £ (0,5/2) — R, where 
R C (0,(5/2) is finite (possible empty). Thus, by Theorem 4.2 in ?, for any < U < 1 there exists a hiiU) 
such that, for every h < h 2 (U), 

W > u - m 

uniformly over the set of balls A r with centers in L and radius r S (0,(5/2) — R. Furthermore, since 
SVL PxeL,y£L h \\ x ~ v\\ ~ > as h ~ > 0' we can choose t/ and h\(U) such that ^w^-y^ > U, for all /i < hi(U), 
uniformly over the balls A r . Therefore, by choosing an appropriate c < c* and U, we may then assume that 
hi(c) < h 2 (U). Set h* = (h 2 {U) - h x (c))/2. 

Let p < p be the radius of A, a ball centered in L/j, and let D S (1,2) be fixed. We distinguish three 
cases. 

1. p > Dd(h*) and h > h* . 

In this simple case, we immediately obtain 

Ph(A n L h ) > \p(B) = \v d c d p d . (41) 



2. p> Dd(h*) and h < h* . 

By the definition of d(h*), there exists a ball of radius (D — l)p and center in L contained in An Lh. 
Thus, using (40), 

Ph(AnL h ) > P h (B) > UP(B) > UC((D - \)p) d \ (42) 
where the last inequality follows from (29). 

3. p < Dd(h*). 

First suppose that A is centered in L. If h > h* , then the ball B having the same center as A and 
radius mm{p,d(h*)} is entirely contained in L h . Thus, P h (AnL h ) > P h (B), and Ph(B) at least \vdP d 
ii p < d(h*) and at least \vd{p/D) d if d(h*) < p < Dd(h*), from which it follows that 

P h (AnL h )>Xv d (p/D) d . (43) 

If h<h*, then, by (40), 

P h (Af]L h )> UP{A) > UCp d ' . (44) 

We now consider the other case of A centered in Lh — L. Without loss of generality, we will only need 
to analyze the situation in which the _ball A is centered at dLh ■ Indeed, for each ball A of any radius 
and center in Lh, there exists a ball Ah with the same radius and center in dLh such that 

Ph{A) > Ph(Ah). 
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If p > Dd(h) (so that h < h*), the same arguments used in case 2 above apply. Thus, suppose 
p < Dd(h). By assumption (G), for each h, dLh is a (d— l)-dimensional closed embedded submanifold 
of M. d . Thus, by a straightforward adaptation of Proposition A.l in Biau et al. (2007), there exists a 
p(h) such that for each r < p(h) and each ball A of radius r and center in dL^, there exists a ball B 
of radius cr such that ficiflii,. Then, there exists a constant r > such that 

0<h<h p(fl) 

As a result, there exists a ball B C AnLh of radius j^cp if p(h) < p < d(h) and of radius cp if p < p(h), 
so that 

Ph(A n L h ) > P h {B) > Xv d ( T /Dc) d p d . (45) 
The claim follows from taking the minimal value of the constants in (41), (42), (43), (44) and (45). 
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